April 1, 2025
\[ \newcommand\hbb{{\hat{\boldsymbol \beta}}} \newcommand\bb{{\boldsymbol \beta}} \newcommand\expn{{\frac{1}{N} \sum \limits_{i = 1}^N}} \newcommand\sumk{\sum \limits_{k = 1}^K} \newcommand\argminb{\underset{\bb}{\text{argmin }}} \newcommand\argmaxb{\underset{\bb}{\text{argmax }}} \newcommand\gtheta{\mathbf g(\boldsymbol \theta)} \newcommand\htheta{\mathbf H(\boldsymbol \theta)} \]
Goal: Come up with a strategy to learn \(P(\mathbf x)\) given a large set of inputs - \(\mathbf X\)
Success:
Density estimation: Given a proposed data point, \(\mathbf x_i\), what is the probability with which we could expect to see that data point? Don’t generate data points that have low probability of occurrence!
Sampling: How can we generate novel data from the model distribution? We should be able to sample from the distribution!
Representation: Can we learn meaningful feature representations from \(\mathbf x\)? Do we have the ability to exaggerate certain features?
All methods we’ll talk about can be sampled!
Goal: Come up with a strategy to learn \(P(\mathbf x)\) given a large set of inputs - \(\mathbf X\)
A collection of images
A collection of sentences
The easiest way to approach this is with a common probability identity.
Let \(\mathbf x\) be a vector of inputs - \([x_1,x_2,....,x_P]\)
The joint density over inputs is then:
\[ P(\mathbf x) = f(x_1,x_2,x_3,...,x_P) \]
The probability chain rule:
\[ f(x_1,x_2,x_3,...,x_P) = f(x_1)f(x_2 | x_1)f(x_3 | x_1,x_2)...f(x_P | x_{P-1}, x_{P-2},...) \]
If it is possible to learn the conditional density of the next input given the previous inputs, then we’ve learned the joint density of the inputs!
Advancements in recurrent image generation:
Pixel CNN - exchange recurrence for CNN style windowing; a little faster
ImageGPT - exchange recurrence for masked self-attention; a little faster with really large training sets
These do a great job when we can compute them in finite time!
Painfully slow!!!
ImageGPT has only been shown to work for up to 64 x 64 input images
Literal days of training time to learn joint densities for small-ish data sets
Can’t be used generally to create a dictionary of all images!
What autoregressive methods do well:
Explicit density estimation - given a starting prompt/word, we can compute the joint probability of the next word directly
Fast at generation (for text)
What autoregressive methods don’t do well:
Images (too high dimensional)
No feature representation
What can we do to generate images?
The rest of the class will be on image/video generation!
These methods can be used for text (or other 1d sequential data like sound waves)
Autoregressives are just the dominant generation method at this point in time!
Also, very useful for descriptive purposes!
A classic in the unsupervised literature is principal components analysis.
A more general statement:
Given a corpus of inputs, find a low dimensional representation of the inputs that minimizes reconstruction error
\[ \frac{1}{N} \sum \limits_{i = 1}^N \| \mathbf x_i - d(e(\mathbf x_i))\|^2_2 \]
\(e(\mathbf x_i) = \mathbf z_i\) is an encoder function that maps the input to a latent space
\(d(\mathbf z_i)\) is a decoder function that maps the latent variable back to original feature space
How we choose to find \(e()\) and \(d()\) dictates the method
PCA
Given a \(N \times P\) matrix of input features, find a single weight matrix, \(\mathbf W\), with column rank \(K\) that minimizes the objective function:
\[ \frac{1}{N} \sum \limits_{i = 1}^N \| \mathbf x_i - \mathbf W \mathbf W^T \mathbf x_i\|^2_2 \]
If the input features are correlated, then we don’t need to see them all individually to reconstruct the input
Find a lower dimensional representation of the input vector that retains most of the information
\(\mathbf W\) is a \(P \times K\) matrix of weights where \(K << P\) that controls both the mapping of the input to the latent space and the mapping of the latent space back to the input space
\(\mathbf z_i\) is a \(K\)-dimensional latent representation of the input that is derived through a linear mapping - \(\mathbf W^T \mathbf x_i\)
\[ \frac{1}{N} \sum \limits_{i = 1}^N \| \mathbf x_i - \mathbf W \mathbf z_i \|^2_2 \]
Two additional constraints:
The columns of \(\mathbf W\) should be orthogonal directions
The columns of \(\mathbf W\) should be normalized
PCA operates by using the eigenvalues and eigenvectors of the square covariance matrix for the centered feature matrix:
\[ \mathbf X^T \mathbf X = \mathbf Q \mathbf D \mathbf Q^{-1} \]
where \(\mathbf Q\) is the collection of \(P\) eigenvectors ordered by eigenvalue (from large to small)
To create a rank \(K\) approximation of the original feature matrix, set \(\mathbf W = \mathbf Q_{1:K}\)
What’s neat about PCA is that it produces the best orthogonal rank-K approximation to the input under linearity and squared reconstruction error
No other method (under these constraints) will do better!
This proof can be found here
Why this construction?
It admits a nice analytical solution with good properties.
Can be solved via a basic computer that can find eigenvectors
Computational constraints led to the solution - no other particularly good reason!
Using PCA, we can:
Get rid of redundant features
Reduce dimensionality
Parse away noise
Pass latent scores, \(\mathbf Z\), to supervised learning procedures
Additionally, the dimensions of the latent space correspond to factors of variation within the training set!
A key component of the data scientist’s toolkit!
Problem: Linearity and strictly invertible mappings are restrictive
Think of PCA as creating a mapping from the feature space to a latent space and then inverting the mapping to return an approximation to the original feature space
PCA seeks to solve the generic problem of reconstruction:
\[ \|\mathbf X - d(e(\mathbf X))\|^2_2 \]
This is referred to as a deterministic bottleneck autoencoder
Learn a set of encoder and decoder functions that map \(\mathbf X\) to itself!
Restriction: each input instance passes through a low-dimensional bottleneck ( \(K << P\) )
Can’t just learn itself, so needs to set up \(\mathbf Z\) to represent as much of the variation in \(\mathbf X\) as possible
Note that PCA is a special case!
No good reason to do this when we have autodiff that can solve for arbitrarily complex models
How do you choose the size of the bottleneck?
This largely echoes the same discussion for PCA
Just choose a small number and see how well we’re able to recover the images
Stop adding dimensions when the next one doesn’t reduce the loss too much
Use validation error to choose - typically not needed since this would require serious tuning.
Set at 2 for images, 16/32/64/128/etc.
A few notes:
Everyone understands PCA, not necessarily autoencoders. Follow disciplinary norms.
For images, the CNN backbone helps to induce structure in the recovered images! The reconstruction loss is about the same as a deep feedforward autoencoder in a lot of situations, but it helps to find structure for recovery.
For images, autoencoders strictly dominate PCA!
What can we do with deterministic autoencoders?
Like transformers, split for different tasks.
Encoder ( \(\mathbf X \rightarrow \mathbf Z\) )
Use as you would for PCA. Pass lower dimensional representation to supervised model for better predictive performance.
Use to examine feature maps to understand sources of shared variation within input features.
The decoder is the interesting part!
Decoder ( \(\mathbf Z \rightarrow \mathbf X\) ):
In theory, \(\mathbf Z\) shares most of the same information as \(\mathbf X\).
Lower dimensional representation means more manageable.
The decoder dictates a mapping from \(\mathbf Z\) to \(\mathbf X\)
Any thoughts?
Given an input \(\mathbf z\), we could use that to translate back up to the feature space!
How do we choose \(\mathbf z\) to be viable?
Random dart throws?
Pick a point in the convex hull of the latent space defined by the training data?
Goal: Come up with a strategy to learn \(P(\mathbf x)\) given a large set of inputs - \(\mathbf X\)
Success:
Density estimation: Given a proposed data point, \(\mathbf x_i\), what is the probability with which we could expect to see that data point? Don’t generate data points that have low probability of occurrence!
Sampling: How can we generate novel data from the model distribution? We should be able to sample from the distribution!
Representation: Can we learn meaningful feature representations from \(\mathbf x\)? Do we have the ability to exaggerate certain features?
This approach doesn’t generate viable images
This isn’t a proper probabilistic model!
There isn’t ever an expression of \(P(\mathbf X)\)
The method can’t learn how to fill in the gaps between digits appropriately because we never asked it to do that!
We can’t say the probability with which we would expect to see a data point given the data.
We can’t sample from the resulting approximation to \(P(\mathbf X)\) because there isn’t a distribution.
A probabilistic version of PCA (sort of kinda) is called factor analysis
The construction is pretty similar - \(P\) dimensional feature space to \(K\) dimensional latent space, \(\mathbf Z\)
The difference is that we give structure to the latent space by making a prior assumption
Not that consequential since the latent space is to be learned from the data.
Just determines what the latent space will look like.
Most common version - Gaussian factor analysis
\[ f(\mathbf z) = \mathcal N_K(\mathbf z | \mathbf 0 , \mathcal I_K) \]
\[ f(\mathbf x | \mathbf z , \boldsymbol \Theta) = \mathcal N_P(\mathbf x | \mathbf W \mathbf z + \boldsymbol \alpha , \boldsymbol \Psi) \]
where \(\mathbf W\) is a \(P \times K\) matrix of weights, \(\mathbf z\) is a \(K\)-vector of latent values, \(\boldsymbol \alpha\) is an intercept term, and \(\boldsymbol \Psi\) is a \(P \times P\) covariance matrix.
What makes a model generative?
We should be able to provide an answer to the question:
\[ P(\mathbf x | \boldsymbol \Theta) = ? \]
for any viable \(\mathbf x\).
A generative model is one where we can answer question about the structure of \(\mathbf x\)
Frequently under assumptions, but still can answer it!
Classic example - Logistic Regression vs. QDA
Logistic regression:
\[ P(y = 1 | \mathbf x, \hat{\bb}) = \sigma(\mathbf x^T \hat{\bb}) \]
\[ P(\mathbf x | \hat{\bb}) = \int P(y = 1 | \mathbf x) P(\mathbf x) dy \]
QDA:
Make an assumption that each \(\mathbf x\) conditional on its class label is a random draw from
\[ \mathbf x | y = c \sim \mathcal N_P(\mathbf x | \boldsymbol \mu_c , \boldsymbol \Sigma_c) \]
We can also marginalize over the class label by placing a prior on \(y\):
\[P(\mathbf x | \boldsymbol \mu , \boldsymbol \Sigma) = \sum \limits_{c = 1}^C \mathcal N_P(\mathbf x | \boldsymbol \mu_c , \boldsymbol \Sigma_c) P(y = c)\]
where the prior is a vector of probabilities (summing to 1) over the possible class labels.
In non-generative models, we cannot address the probability with which we would see a given feature vector!
In fully generative models, we give enough information to address any probability we want. A generative classifier:
\[ P(\mathbf x | \boldsymbol \Theta) = \int P(\mathbf x | \boldsymbol \Theta , y) P(y) dy \]
\[ P(y | \mathbf x , \boldsymbol \Theta) = \frac{P(\mathbf x | y , \boldsymbol \Theta) P(y)}{P(\mathbf x | \boldsymbol \Theta)} \]
The cost is assumptions about structure of \(\mathbf x\)
If that’s our goal, though, a necessary choice (no free lunch!)
PCA:
\[ \mathbf z = \mathbf W^T \mathbf x \]
\[ \mathbf x = \mathbf W \mathbf z \]
where \(\mathbf W\) is a \(P \times K\) weight matrix with \(K << P\).
Optimal solution under squared reconstruction error is \(\mathbf W = \mathbf Q_K\) where \(\mathbf Q_K\) is the first \(K\) eigenvectors of the covariance matrix
\[ \mathbf X^T \mathbf X = \mathbf Q \mathbf D \mathbf Q^{-1} \]
with \(\mathbf D\) being a diagonal matrix with the eigenvalues sorted from smallest to largest.
Generative goal:
\[ P(\mathbf x | \mathbf W) = \int P(\mathbf x | \mathbf W , \mathbf z) P(\mathbf z) d \mathbf z \]
For PCA, all we know is that \(\mathbf x = \mathbf W \mathbf z\) and that \(\mathbf z = \mathbf W^T \mathbf x\)
DOES NOT COMPUTE!!!
Solution: Assumptions. For centered \(\mathbf x\):
\[ P(\mathbf x | \mathbf W , \boldsymbol \Psi, \mathbf z) = \mathcal N_P(\mathbf x | \mathbf W \mathbf z, \boldsymbol \Psi) \]
Each \(\mathbf x\) is a random draw from a multivariate normal distribution centered on the convex combination of \(\mathbf W\) and a latent vector \(\mathbf z\) with covariance matrix equal to \(\boldsymbol \Psi\)
Similar to PCA, just saying that we have some uncertainty in the mapping of \(\mathbf z \rightarrow \mathbf x\)
\[ P(\mathbf z) = \mathcal N_K(\mathbf z | 0 , \mathcal I_K) \]
Prior to seeing any data, we believe that each \(\mathbf z\) is a random draw from a standard multivariate normal
Inconsequential choice since we’re going to learn \(\mathbf z\) anyways
Our goal: Find values of \(\mathbf W\) and \(\boldsymbol \Psi\) that maximize the likelihood with which we would observe our input features given the parameters.
\[ \{ \hat{\mathbf W} , \hat{\boldsymbol \Psi}\} = \underset{\mathbf W , \boldsymbol \Psi}{\text{argmax }} \prod \limits_{i = 1}^N P(\mathbf x_i | \mathbf W , \boldsymbol \Psi) \]
But \(\mathbf x\) depends on the values of the latent variables, \(\mathbf z\), so:
\[ \{ \hat{\mathbf W} , \hat{\boldsymbol \Psi}\} = \underset{\mathbf W , \boldsymbol \Psi}{\text{argmax }} \prod \limits_{i = 1}^N \int \mathcal N_P (\mathbf x_i | \mathbf W \mathbf z , \boldsymbol \Psi) \mathcal N_K(\mathbf z_i | \mathbf 0 , \mathcal I_K) d\mathbf z \]
Previously, we’ve seen that MLE problems can be made simpler by optimizing the log-likelihood
\[ \{ \hat{\mathbf W} , \hat{\boldsymbol \Psi}\} = \underset{\mathbf W , \boldsymbol \Psi}{\text{argmax }} \sum \limits_{i = 1}^N \log \int \mathcal N_P (\mathbf x_i | \mathbf W \mathbf z , \boldsymbol \Psi) \mathcal N_K(\mathbf z_i | \mathbf 0 , \mathcal I_K) d\mathbf z \]
Unfortunately, we can’t push the log inside the integral!
Finding the derivative of a log-integral is not particularly easy…
Maximizing this expression is tricky!
What’s holding us back here is that we need to learn \(\mathbf z\) and integrate over it
It would be way easier if we knew \(\mathbf z\) beforehand
But, \(\mathbf z\) is latent, so we learn it as a function of the data!
The Gaussian factor model has many different solution methods:
Eigendecomposition (it’s equivalent to PCA with a light twist)
MLE via an expectation-maximization routine
We’re going to show a different method that will be extendable for a more general version of factor analysis!
Why Bayesian Methods?
Latent variables have a natural distributional structure
“Fills in the gaps” via prior assumptions
Everything is a distribution, so everything can be sampled
We’re going to spend some time covering the basics of Bayesian methods and link back to the most common fitting method for generative models for images - variational Bayes
Before we jump into complicated Bayesian methods, we need to understand how Bayesian methods work
Start with alternative solution methods for common problems:
IID samples from a normal distribution with known \(\sigma^2\)
Linear regression under MSE loss
Recall the setup for the linear regression likelihood approach:
\[Pr(\mathbf y | \mathbf X, \boldsymbol \beta , \sigma^2) = \prod \limits_{i = 1}^N \mathcal N(y_i | \mathbf x_i^T \boldsymbol \beta , \sigma^2)\]
Given the conditional distribution of the outcome given the features and our model parameters, choose the model parameters that maximize the likelihood with which we would observe the data.
The idea here is that there is some true value of \(\boldsymbol \beta\) that generates the data
Find the one that is most likely to yield the observed outcomes
An alternative way to view the problem:
Let \(\boldsymbol \theta = \{\boldsymbol \beta, \sigma^2\}\) be the collection of unknowns which we would like to learn from the data, \(\mathcal D\).
Prior to observing any data we have some belief about the values of these unknowns that can be represented via a proper probability density function
\[f(\boldsymbol \theta)\]
The goal then is to condition our prior beliefs on the data we’ve observed
\[f(\boldsymbol \theta | \mathcal D)\]
The rational way to do this is to leverage Bayes Theorem
\[f(\boldsymbol \theta | \mathcal D) = \frac{f(\mathcal D | \boldsymbol \theta)f(\boldsymbol \theta)}{f(\mathcal D)}\]
\[f(\boldsymbol \theta | \mathcal D) = \frac{f(\mathcal D | \boldsymbol \theta)f(\boldsymbol \theta)}{f(\mathcal D)}\]
Likelihood
\[f(\mathcal D | \boldsymbol \theta)\]
Given some values of the unknowns, what is the probability with which we would observe the data?
This is the same likelihood we’ve already seen!
For a linear regression problem:
\[f(\mathbf y | \mathbf X, \boldsymbol \beta, \sigma^2) = \prod \limits_{i = 1}^N \mathcal N(y_i | \mathbf x_i^T \boldsymbol \beta , \sigma^2)\]
\[f(\boldsymbol \theta | \mathcal D) = \frac{f(\mathcal D | \boldsymbol \theta)f(\boldsymbol \theta)}{f(\mathcal D)}\]
Prior
\[f(\boldsymbol \theta)\]
Prior to seeing any data, what are our beliefs about the values of the unknowns?
Must be a probability density function that integrates to 1
How do we choose a prior distribution?
Convenience/Achieve a specific purpose
Elicitation (actually use prior knowledge)
Low information
For linear regression, we’ll soon see that a convenient choice of prior for the coefficients follows a multivariate normal distribution
Let \(\mathbf z\) be a random vector of length \(P\). We believe that each value of this vector is drawn marginally from a univariate normal distribution. However, each element of \(\mathbf z\) may correlate with other values!
\[\mathbf z \sim \mathcal N_P(\mathbf z | \boldsymbol \mu , \boldsymbol \Sigma)\]
\[f(\mathbf z | \boldsymbol \mu, \boldsymbol \Sigma) = (2 \pi)^{-P/2} \text{det}\left( \boldsymbol \Sigma \right)^{1/2} \exp \left[- \frac{1}{2} (\mathbf z - \boldsymbol \mu)^T \boldsymbol \Sigma^{-1} (\mathbf z - \boldsymbol \mu) \right]\]
\(\boldsymbol \mu\) is a \(P\)-vector of element means; what is the average draw of the \(j^{th}\) element of the vector?
\(\boldsymbol \Sigma\) is a \(P \times P\) covariance matrix. The diagonal elements tell us the variance of the different elements of \(\mathbf z\). The off-diagonals tell us the pairwise covariance between different elements of \(\mathbf z\).
A convenient prior is then:
\[\boldsymbol \beta \sim \mathcal N_P(\boldsymbol \beta | \boldsymbol \mu_0 , \boldsymbol \Sigma_0)\]
\[f(\boldsymbol \theta | \mathcal D) = \frac{f(\mathcal D | \boldsymbol \theta)f(\boldsymbol \theta)}{f(\mathcal D)}\]
Marginal Likelihood
\[f(\mathcal D) = \int \limits_{\theta} f(\mathcal D | \boldsymbol \theta)f(\boldsymbol \theta) d\theta\]
What is the marginal probability with which we would expect to see our data?
Think of this as the probability with which we could see the data integrating over our prior
How consistent is the data we observe with our prior belief and our choice of model.
Higher values mean that holding the prior constant our model does a good job of describing the data.
The marginal likelihood is always a constant scalar value!
Since we’re integrating away the unknowns, there are no parts of the resulting formula that are random
Therefore, this will not end up being a distribution!
A useful way to compare models
However, this quantity can be quite difficult to compute.
\[f(\boldsymbol \theta | \mathcal D) = \frac{f(\mathcal D | \boldsymbol \theta)f(\boldsymbol \theta)}{f(\mathcal D)}\]
Posterior
\[f(\boldsymbol \theta | \mathcal D)\]
Since our prior is a probability distribution, our posterior beliefs also follow a probability distribution
The posterior is a proper probability distribution that encodes:
The Bayes consistent likelihood with which we expect that any unknown is equal to any value
Our uncertainty about the values of the unknowns after observing data!
What is our uncertainty around maximum likelihood estimates?
Let \(\hat{\boldsymbol \theta}\) be a set of maximum likelihood estimates given a likelihood:
\[\hat{\boldsymbol \theta} = \underset{\boldsymbol \theta}{\text{argmax }} \prod \limits_{i = 1}^N Pr(\mathcal D_i | \boldsymbol \theta)\]
Under sufficient regularity conditions, when \(N \rightarrow \infty\) and appealing to CLT:
\[\hat{\boldsymbol \theta} \sim \mathcal N_P(\hat{\boldsymbol \theta} | \boldsymbol \theta , I_N(\boldsymbol \theta)^{-1})\]
where \(I_N(\boldsymbol \theta)\) is the Fisher information:
\[I_N(\boldsymbol \theta) = -E_{\boldsymbol \theta}\left[\frac{d^2}{d \boldsymbol \theta^2} \sum \limits_{i = 1}^N \log Pr(\mathcal D_i | \boldsymbol \theta)\right]\]
In words, the uncertainty around our estimates are asymptotically normal
All of the caveats that come from frequentist statistics
Large Sample theory
Bayesian methods yield uncertainty as a function of the prior and the data
Native to the posterior distribution!
We’ll see this via example
Before jumping into regression, let’s look at a simpler example that demonstrate the Bayesian estimation process.
Example: Normal likelihood, known variance
Suppose that we observe \(N\) instances of \(y\) that we believe are i.i.d. draws from a normal distribution with a known value for the variance. \(\mu\) - the mean of the normal - is unknown.
Likelihood:
\[f(\mathbf y | \mu , \sigma^2) = \prod \limits_{i = 1}^N \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left[ - \frac{1}{2 \sigma^2} (y_i - \mu)^2 \right]\]
Our prior distribution over the unknown, \(\mu\), will be a normal distribution:
\[f(\mu) \sim \mathcal N(\mu | \mu_0 , \sigma^2_0)\]
with some arbitrary mean and variance.
We can define our posterior distribution as:
\[f(\mu | \mathbf y) = \frac{\prod \limits_{i = 1}^N \mathcal N(y_i | \mu , \sigma^2)\mathcal N(\mu | \mu_0 , \sigma^2_0)}{\int \limits_{-\infty}^{\infty} \prod \limits_{i = 1}^N \mathcal N(y_i | \mu , \sigma^2)\mathcal N(\mu | \mu_0 , \sigma^2_0) d\mu}\]
This is our posterior distribution!
It’s just not in a useful form
What is the mode of the posterior?
What is the variance of the posterior?
How do we make posteriors useful?
Four strategies:
Analytically find an easy to work with form for the posterior
Find the posterior mode and apply a Laplace approximation
Take a large number of random draws from the posterior (Not covered in this class, see Gibbs and Metropolis approaches)
Find an approximation that maximizes the marginal likelihood
Sometimes, posterior distributions can be analytically simplified
Let’s show an example of this.
\[f(\mu | \mathbf y) = \frac{\prod \limits_{i = 1}^N \mathcal N(y_i | \mu , \sigma^2)\mathcal N(\mu | \mu_0 , \sigma^2_0)}{\int \limits_{-\infty}^{\infty} \prod \limits_{i = 1}^N \mathcal N(y_i | \mu , \sigma^2)\mathcal N(\mu | \mu_0 , \sigma^2_0) d\mu}\]
First simplification:
The denominator is a constant that is not random!
We’re integrating over \(\mu\) and everything else is known!
It’s purpose is to ensure that the posterior integrates to 1.
\[f(\mu | \mathbf y) = \frac{1}{Z} \prod \limits_{i = 1}^N \mathcal N(y_i | \mu , \sigma^2)\mathcal N(\mu | \mu_0 , \sigma^2_0)\]
\[f(\mu | \mathbf y) = \frac{1}{Z} \prod \limits_{i = 1}^N \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left[ - \frac{1}{2 \sigma^2} (y_i - \mu)^2 \right] \frac{1}{\sqrt{2 \pi \sigma_0^2}} \exp \left[ - \frac{1}{2 \sigma_0^2} (\mu - \mu_0)^2 \right]\]
Simplification #2:
We only care about the random terms
e.g. terms that have to do with \(\mu\)
Anything else can be moved to the front constant
\[f(\mu | \mathbf y) = C \prod \limits_{i = 1}^N \exp \left[ - \frac{1}{2 \sigma^2} (y_i - \mu)^2 \right] \exp \left[ - \frac{1}{2 \sigma_0^2} (\mu - \mu_0)^2 \right]\]
The product of exponentials is the sum of the exponents terms:
\[f(\mu | \mathbf y) = C \exp \left[ - \frac{1}{2 \sigma^2} \sum \limits_{i = 1}^N (y_i - \mu)^2 \right] \exp \left[ - \frac{1}{2 \sigma_0^2} (\mu - \mu_0)^2 \right]\]
Applied to wrap the prior in:
\[f(\mu | \mathbf y) = C \exp \left[ - \frac{1}{2 \sigma^2} \sum \limits_{i = 1}^N (y_i - \mu)^2 - \frac{1}{2 \sigma_0^2} (\mu - \mu_0)^2 \right]\]
Expanding the squares and distributing the sum:
\[f(\mu | \mathbf y) \propto \exp \left[ - \frac{1}{2 \sigma^2} \left(\sum y_i^2 - 2 \mu \sum y_i + N \mu^2 \right) - \frac{1}{2 \sigma_0^2} \left(\mu^2 - 2 \mu \mu_0 + \mu_0^2 \right) \right]\]
Combining like terms:
\[f(\mu | \mathbf y) \propto \exp - \frac{1}{2} \bigg(\frac{1}{\sigma^2} \sum y_i^2 - \frac{1}{\sigma^2_0} \mu_0 \\- 2 \mu \left(\frac{1}{\sigma^2}\sum y_i + \frac{1}{\sigma^2_0} \mu_0 \right) + \mu^2 \left(\frac{N}{\sigma^2} + \frac{1}{\sigma^2_0} \right) \bigg)\]
Now, take advantage of the fact that we can get rid of any multiplicative terms that don’t have to do with \(\mu\)
\[f(\mu | \mathbf y) = C' \exp \left[ - \frac{1}{2} \left(- 2 \mu \left(\frac{1}{\sigma^2}\sum y_i + \frac{1}{\sigma^2_0} \mu_0 \right) + \mu^2 \left(\frac{N}{\sigma^2} + \frac{1}{\sigma^2_0} \right) \right) \right]\]
This looks a lot like the kernel of a normal distribution:
\[C \exp\left[-\frac{1}{2 s^2}(z - m)^2 \right] = C' \exp\left[-\frac{1}{2 s^2}(z^2 - 2 z m) \right]\]
Can we put this in the form of the Gaussian kernel?
The Gaussian identity for random variable \(y\): \[C' \exp\left[-\frac{1}{2}(ay^2 - 2 b y) \right]\]
is equivalent to a normal distribution (up to a multiplicative constant) where:
\[\hat{\sigma}^2 = a^{-1}\]
\[\hat{\mu} = -a^{-1}b\]
This means that our posterior is a normal distribution:
\[f(\mu | \mathbf y) \sim \mathcal N(\mu | \hat{\mu} , \hat{\sigma}^2)\]
where
\[\hat{\sigma}^2 = \left( \frac{N}{\sigma^2} + \frac{1}{\sigma^2_0} \right)^{-1}\]
and
\[\hat{\mu} = \left( \frac{N}{\sigma^2} + \frac{1}{\sigma^2_0} \right)^{-1}\left(\frac{1}{\sigma^2}\sum y_i + \frac{1}{\sigma^2_0} \mu_0 \right)\]
That’s convenient that we ended up with a normal posterior distribution!
It’s not a coincidence
Conjugate Prior
For certain likelihood forms, the posterior is of the form of the prior
Normal/Normal
Beta/Bernoulli
Multivariate Normal/Multivariate Normal
Prior form is chosen out of convenience!
Interesting observation:
\[\hat{\mu} = \left( \frac{N}{\sigma^2} + \frac{1}{\sigma^2_0} \right)^{-1}\left(\frac{1}{\sigma^2}\sum y_i + \frac{1}{\sigma^2_0} \mu_0 \right)\]
If the prior variance is small (meaning a really precise prior or we have a strong belief about the location of \(\mu\)), the posterior mean will be really close to \(\mu_0\)
The numerator will be dominated by the value of the prior mean
Interesting observation:
What happens if we have a prior with infinite variance?
\[\hat{\sigma}^2 = \left( \frac{N}{\sigma^2} + \frac{1}{\sigma^2_0} \right)^{-1} = \frac{\sigma^2}{N}\]
and
\[\hat{\mu} = \left( \frac{N}{\sigma^2} + \frac{1}{\sigma^2_0} \right)^{-1}\left(\frac{1}{\sigma^2}\sum y_i + \frac{1}{\sigma^2_0} \mu_0 \right) = \frac{\sigma^2}{N} \frac{\sum y_i}{\sigma^2} = \frac{\sum y_i}{N}\]
The moments of the posterior for the mean become the maximum likelihood estimates: the sample average and the standard error.
We can think of MLE as a special case of Bayesian methods where we have diffuse priors
Another interesting result:
As \(N \to \infty\) with a non-infinite variance prior
\[\hat{\sigma}^2 \approx \frac{\sigma^2}{N}\]
\[\hat{\mu} \approx \frac{\sum y_i}{N}\]
At a certain point, our prior doesn’t matter!
We can see MLE as the asymptotic limit of the posterior distribution as \(N\) gets large
How is this useful for us?
Let’s now look at the linear regression problem. We want to find the posterior over the regression coefficients given the data we observe and assuming a known variance:
\[f(\boldsymbol \beta | \mathbf X , \mathbf y, \sigma^2)\]
The likelihood:
\[f(\mathbf y | \mathbf X , \boldsymbol \beta , \sigma^2) = \prod \limits_{i = 1}^N \mathcal N(y_i | \mathbf x^T \boldsymbol \beta , \sigma^2)\]
Because this is a product of independent normals, we can represent this via a \(N\) dimensional multivariate normal distribution:
\[f(\mathbf y | \mathbf X , \boldsymbol \beta , \sigma^2) \sim \mathcal N_N(\mathbf y | \mathbf X \boldsymbol \beta , \sigma^2 \mathcal I_N)\]
The prior:
Let’s say that the prior over the \(P\) coefficients follows a multivariate normal distribution with mean \(\boldsymbol \mu_0\) and covariance matrix \(\boldsymbol \Sigma_0\)
\[f(\boldsymbol \beta) \sim \mathcal N_P(\boldsymbol \beta | \boldsymbol \mu_0 , \boldsymbol \Sigma_0)\]
This means that our posterior is:
\[f(\boldsymbol \beta | \mathbf X , \mathbf y, \sigma^2) = \frac{1}{Z} \mathcal N_N(\mathbf y | \mathbf X \boldsymbol \beta , \sigma^2 \mathcal I_N) \mathcal N_P(\boldsymbol \beta | \boldsymbol \mu_0 , \boldsymbol \Sigma_0)\]
What we know:
This means that:
\[f(\boldsymbol \beta | \mathbf X , \mathbf y, \sigma^2) \sim \mathcal N_P(\boldsymbol \beta | \hat{\boldsymbol \mu} , \hat{\boldsymbol \Sigma})\]
The proof for this is quite long, but follows closely to the normal/normal proof we did above. It is outlined in all its glory in the lecture notes for Bayesian Regression included on the Canvas site.
The punchline:
\[f(\boldsymbol \beta | \mathbf X , \mathbf y, \sigma^2) \sim \mathcal N_P(\boldsymbol \beta | \hat{\boldsymbol \mu} , \hat{\boldsymbol \Sigma})\]
\[\hat{\boldsymbol \Sigma} = \left(\frac{1}{\sigma^2} \mathbf X^T \mathbf X + \boldsymbol \Sigma_0^{-1} \right)^{-1}\]
\[\hat{\boldsymbol \mu} = \hat{\boldsymbol \Sigma}\left(\frac{1}{\sigma^2} \mathbf X^T \mathbf y + \boldsymbol \Sigma^{-1}_0 \boldsymbol \mu_0 \right)\]
Implication 1:
When all of the prior variances are diffuse
\[\hat{\boldsymbol \mu} = \left(\frac{1}{\sigma^2} \mathbf X^T \mathbf X \right)^{-1}\left(\frac{1}{\sigma^2} \mathbf X^T \mathbf y \right) = (\mathbf X^T \mathbf X)^{-1}\mathbf X^T \mathbf y\]
OLS is the solution when we have no prior info!
This also holds when the number of training samples goes to \(\infty\)
Implication 2:
Let’s set \(\boldsymbol \mu_0 = \mathbf 0\) and \(\boldsymbol \Sigma_0 = \tau^2 \mathcal I_P\)
This means that:
\[\hat{\boldsymbol \Sigma} = \left(\frac{1}{\sigma^2} \mathbf X^T \mathbf X + \frac{1}{\tau^2} \mathcal I_P \right)^{-1}\]
\[\hat{\boldsymbol \mu} = \hat{\boldsymbol \Sigma}\left(\frac{1}{\sigma^2} \mathbf X^T \mathbf y + \frac{1}{\tau^2} \boldsymbol \mu_0 \right)\]
Let \(\lambda = \frac{\tau^2}{\sigma^2}\). We can rearrange the posterior mean to be:
\[\hat{\boldsymbol \mu} = (\mathbf X^T \mathbf X + \lambda \mathcal I_P)^{-1} \mathbf X^T \mathbf y\]
which is the analytical solution for Ridge regression
Regularization with the L2 norm is Bayesian regression with a normal prior on the coefficients centered at 0!
Regularization methods are us setting up desired or believed values for the regression coefficients
The L2 penalty (therefore weight decay) is a Bayesian solution to the generalization problem
The strength of this regularization is then the level at which we want to simplify our results
The Bayesian approach also justifies letting the model be more complex as \(N\) grows
I find this to be a satisfying justification for regularization!
MAP approximations, Marginal Likelihood maximization, and Variational Bayes